-
Notifications
You must be signed in to change notification settings - Fork 581
gep: standardizing behavior for invalid BackendTLSPolicy #3909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gep: standardizing behavior for invalid BackendTLSPolicy #3909
Conversation
Signed-off-by: Norwin Schnyder <[email protected]>
geps/gep-1897/index.md
Outdated
|
||
For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections. | ||
Instead, the corresponding TLS connection MUST fail, and the client MUST receive an HTTP error response. | ||
Additionally, the `Accepted` status condition of the BackendTLSPolicy MUST be set to `False` with the reason `Invalid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again, it might make sense to introduce the ResolvedRefs
condition for policies as well. However, I’m not sure whether it fits within the current schedule for graduating the BackendTLSPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be difficult to implement due to the distinct processes of policy application and certificate validation against the backend. It's a valid scenario that the BTP was accepted, and configuration properly propagated by the controller, but connectivity is broken due to certificate misconfiguration. All information needed to debug this issue should be passed by inspecting the BTP spec against the Service to which the Policy was applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @kl52752 that ResolvedRefs would be difficult to set for anything that was tied to dataplane/connectivity. On the other hand, ResolvedRefs seems like a useful concept on BackendTLSPolicy for the case that a CACertRef is invalid/points to something that doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there can be a maximum of 8 CACertificateRefs on the BTP validation type, so we would also want to specify here whether a minimum of one wrong CACertificateRef causes Accepted to become false or ResolvedRefs to become false, or for either to reflect some degree of problems (1 of 4 CACertificateRefsInvalid?). Isn't this something we should rather leave as an implementation detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this topic is discussed in multiple threads :) https://github.com/kubernetes-sigs/gateway-api/pull/3909/files#r2230306410
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize: I think that ResolvedRefs should be included and used for certificates that do not exist or are not valid (where "not valid" means "does not contain certain keys", not "unwrap the certificate and check certificate properties").
Decoding the certificate should never be required for an implementation to be conformant. Implementations MAY unwrap if they wish and do additional error handling, but those errors should be in addition to the included error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing Nick's thoughts on here, I guess it would be good to have the definition of what are the conditions that cause an invalid certificate explicit somewhere (eg.: ref does not exist, ref exist but does not contain the right keys, ref is not of secret type tls, etc). This would probably be good for conformance as well.
I still think that for some cases it would be good to say that controllers CAN validate the certificate content, as I can think on numerous cases where a secret with a tls.crt key that contains multiple certificates has some sort of bad formation (invalid PEM, etc) that can cause the whole gateway to go down if it tries to blindly use the certificate content.
Again, not a MUST but probably a recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that adding some explicit examples of what makes a reference invalid is fine - but the second item - "ref exist but does not contain the right keys" requires decoding the Secret, so it can only be a MAY.
trusted certificates, then the associated TLS connection must fail. | ||
trusted certificates, the BackendTLSPolicy is considered invalid. | ||
|
||
For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "Invalid" isn't the right term here because BTP spec is validated by CEL, making an invalid BTP impossible.
If this refers to a hostname or certificate mismatch with backend configuration, please state that explicitly.
Besides, this case is already covered with implementation specific way.
.
On the question of how to signal that there was a failure in the certificate validation, this is left up to the implementation to return a response error that is appropriate, such as one of the HTTP error codes: 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), or other signal that makes the failure sufficiently clear to the requester without revealing too much about the transaction, based on established security requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed off-channel, I had two cases in mind where I would consider a BackendTLSPolicy to be invalid, apart from the standard CEL validation or runtime errors:
If the CertificateRef cannot be resolved or does not include a certificate (tls.crt), the BackendTLSPolicy is considered invalid.
If WellKnownCACertificates is set to "System" and there are no system trusted certificates or the implementation doesn't define system
trusted certificates, the BackendTLSPolicy is considered invalid.
My main concern for raising this is to understand, how would the BackendTLSPolicy signal that the referenced resource in CertificateRef does not exist or cannot be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can created new conditions similar to "ResolvedRefs" condition which is dedicated for listeners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly what I've suggested in #3909 (comment)
An example structure could look like this:
status:
ancestors:
- ancestorRef:
group: gateway.networking.k8s.io
kind: Gateway
name: gw
conditions:
- type: Accepted
reason: Accepted
status: "True"
message: BackendTLSPolicy is accepted
- type: ResolvedRefs
reason: InvalidCertificateRef | UnsuppportedWellKnownCACertificates
status: "False"
message: (implementation specific error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 that anything that references other objects should include a ResolvedRefs condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kl52752 has a point about the pre-existing text regarding failures to connect. I prefer this over an HTTP 5XX error response. What do you think @robscott @youngnick ?
On the question of how to signal that there was a failure in the certificate validation, this is left up to the implementation to return a response error that is appropriate, such as one of the HTTP error codes: 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), or other signal that makes the failure sufficiently clear to the requester without revealing too much about the transaction, based on established security requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To go back to the original point, in this document we're defining an "invalid" BackendTrafficPolicy to be one that either uses an unsupported feature (WellKnownCACertificates
) or has zero valid CertificateRefs, and that's a bit different to "has no syntactic errors", which CEL prevents.
geps/gep-1897/index.md
Outdated
|
||
For an invalid BackendTLSPolicy, implementations MUST NOT fall back to unencrypted (plaintext) connections. | ||
Instead, the corresponding TLS connection MUST fail, and the client MUST receive an HTTP error response. | ||
Additionally, the `Accepted` status condition of the BackendTLSPolicy MUST be set to `False` with the reason `Invalid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be difficult to implement due to the distinct processes of policy application and certificate validation against the backend. It's a valid scenario that the BTP was accepted, and configuration properly propagated by the controller, but connectivity is broken due to certificate misconfiguration. All information needed to debug this issue should be passed by inspecting the BTP spec against the Service to which the Policy was applied.
To evaluate the current behavior of various gateway implementations, I performed a test using
Result:
Observations:
Conclusion: |
Signed-off-by: Norwin Schnyder <[email protected]>
2a97e7b
to
49457d6
Compare
Thanks @snorwin, this is a great reference point!
It seems like we should settle on Gateway as the ancestor as that's the thing that's truly unique. With that said, if all Gateways for a given
To clarify, is this only true when all refs are invalid, or is it also true when there's a mix of valid and invalid refs?
This seems correct to me and lines up with my related comment.
This feels like it could benefit from some conformance test coverage as I doubt they're the only ones doing this.
Interesting, maybe our docs aren't sufficiently clear on recommended status?
I believe spec requires HTTP 500, definitely worth some conformance test coverage to ensure consistency here. |
I think you missed the rest of this sentence @robscott... |
I agree with @snorwin and @robscott about a few things:
|
@youngnick, regarding
You do mean, the BackendTLSPolicy, not the Route, right? |
Yes, sorry. I was also thinking about partial Route invalidity. |
As much as I'd like to see this get into v1.4.0, at this stage in the release cycle, I'd like to take the pulse of the community. For any implementations that pass current conformance tests, this change would require new conformance tests that they may not pass for v1.4.0. /hold |
Okay, I guess I wasn't clear before, so I'll try to be more clear now. To me, adding the
We only get one shot at this transition, it's better to do the work here to make it happen, and implementations can catch up after the fact. It's better to front-load big changes all into one change than to do them piecemeal, in this case. For the details of the ResolvedRefs Condition, for some of the feedback:
I've added some specific comments to the file with some wording here. I think this is very close, and then folks can get on to writing more conformance tests to check these behaviors. |
/label tide/merge-method-squash |
Signed-off-by: Norwin Schnyder <[email protected]>
We talked about this on the community call today: There was general agreement on the call that in terms of WHAT is implemented, there's a preference for the mixed-mode where some misconfigured certificates are tolerated, and statuses to reflect the misconfiguration are propagated. Depending on whether we can get this done before code freeze for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following today's community call and my re-review of this PR I am convinced that defining this behavior is important. I support the "mixed-mode" approach, which allows some invalid references without causing the entire BackendTLSPolicy
to fail.
My reasoning stems from various scenarios, particularly the one I mentioned during the call: when the individual creating the references is different from the one managing the certificates. It would be counterintuitive and potentially damaging for a certificate manager to remove a certificate (to revoke it) without awareness of its reference, leading to a complete disruption of traffic instead of just a portion.
So generally I approve:
/approve
Since we're a couple weeks out from v1.4.0
code freeze, we also have to decide on whether or not we would consider this and the spec and test updates required a blocker for GA, as mentioned here. I'm interested in hearing other people's thoughts especially those who have already implemented BackendTLSPolicy
🤔
Signed-off-by: Norwin Schnyder <[email protected]>
@snorwin would you consider requiring a 502 rather than 500 error response? |
@candita, from a Gateway API perspective, I would argue that the semantics of an invalid backend reference should be similar to those of a backend reference with an invalid BackendTLSPolicy. In that cases, an HTTP 500 error would be appropriate, which is also the behavior adopted by most current implementations. From an Envoy perspective, however, an HTTP 503 might be more suitable, as it is equal to the case where TLS validation for an upstream connection fails. In general, achieving a specific HTTP status code can be cumbersome, depending on the technology and implementation. Therefore, I would leave it to implementers to decide whether they prefer returning 500, 503, or even 502. @shaneutt if we can get this PR merged by the end of the week, I’ll be able to prepare the spec changes and propose the necessary conformance tests in time (before the code freeze). IMO we should aim to include this in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment about domain-prefixing of Reason, but it's non blocking, so this LGTM.
To be specific as well: I do think that this, along with the API changes involved, is required to move BackendTLSPolicy to Standard. If we don't specify this stuff, it will be hard to get good conformance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Looks good to me :)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kl52752, shaneutt, snorwin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @snorwin! /lgtm |
Sounds good, thank you very much. If you wouldn't mind creating an issue for this last part of the effort so I can sub-task it under #1897 for tracking, I would appreciate it. 🖖 |
What type of PR is this?
/kind gep
What this PR does / why we need it:
This PR updates the GEP (#1897) to clarify the behavior of BackendTLSPolicy in cases where certificate ref resolving fails or when system-trusted certificates are not defined.
/cc @candita
Which issue(s) this PR fixes:
Fixes #3516
Does this PR introduce a user-facing change?: